Lab Objective

The following study will explore Bike Chattanooga’s usage patterns, weather patterns, and local context factors from March 01 to May 31, 2022 to develop a model that can potentially inform an effective rebalancing strategy.

Background

When Bike Chattanooga launched in 2012, it was the first city in the Southeastern US to offer a bikeshare program. Since then, over 600,000 trips have been taken on the Bike Chattanooga system - not a bad stat for a city of 180,000 residents. The city has 400 rental bicycles, including 55 E-Bikes, that are rented and may be returned at 42 locations around the municipality. With new grant money from the Tennessee DOT and private philanthropy, and demonstrated growth year over year, expanded e-bike stations and bike routes are in the pipeline. (source)

Urban bikeshare programs have evolved as a response to the increasing demand for sustainable and efficient urban transportation. Originating in the mid-20th century in Europe, early bikeshare systems gained popularity in the 21st century with the advent of technology and increased environmental consciousness. One of the pioneering systems was Lyon’s Vélo’v in 2005, setting the stage for global adoption. Subsequently, major cities worldwide introduced their own bikeshare programs. These initiatives typically operate through a network of strategically placed stations where users can rent bicycles for short trips. Users access bikes through mobile apps or membership cards, unlocking them at one station and returning them at another. The pay-per-ride or subscription-based model encourages short-distance commuting, contributing to reduced traffic congestion and environmental impact while promoting a healthier lifestyle.

The success of urban bikeshare programs hinges on their accessibility, ease of use, and integration with public transit systems. Rebalancing is a critical logistics hurdle for bikeshare program’s reliability and success. Since many cities experience bikeshare ‘rush hours’ it is often the case that bikes are not where riders need them to be. Rebalancing involves the redistribution of bicycles across stations to address spatial and temporal variations in user demand. Currently, many cities employ dedicated teams of rebalancing staff who manually transport bikes between stations based on real-time usage patterns and predictive models. However, the manual nature of this process poses logistical challenges, including high operational costs, inefficiencies, and delays in response to sudden demand fluctuations. The challenge lies in maintaining a delicate equilibrium in bike distribution to optimize user experience, making it imperative for bikeshare programs to explore more efficient and data-driven rebalancing strategies.

Methods

This study employs three primary datasets:

  1. Bike Chattanooga station and ridership data from Spring 2022
  2. weather data collected at Chattanooga Metropolitan Airport during those same dates and accessed via the riem package in R
  3. 2021 5-Y American Community Survey data for Hamilton County, TN.
  4. Open Streets Maps data about highway and waterway locations.

Data Wrangling

Wrangle the data to be operable with the lubridate R package.

dat <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Assignment5/BikeShare/Bike_Chattanooga_Trip_Data_Mar22_May22.csv")

dat$Start.Time <- gsub("/", "-", dat$Start.Time)
dat$Start.Time <- gsub(" AM$", " AM", dat$Start.Time)
dat$Start.Time <- gsub(" PM$", " PM", dat$Start.Time)
dat$End.Time <- gsub("/", "-", dat$End.Time)
dat$End.Time <- gsub(" AM$", " AM", dat$End.Time)
dat$End.Time <- gsub(" PM$", " PM", dat$End.Time)

# Convert to 24-hour clock
dat$Start.Time <- mdy_hms(dat$Start.Time)

# Format the result in 24-hour clock
dat$Start.Time <- format(dat$Start.Time, "%Y-%m-%d %H:%M:%S")

#Parse Out from Lat Long 
dat$from_longitude <- as.numeric(gsub("POINT \\((-?\\d+\\.\\d+) \\d+\\.\\d+\\)", "\\1", dat$Start.Location))
dat$from_latitude <- as.numeric(gsub("POINT \\(-?\\d+\\.\\d+ (-?\\d+\\.\\d+)\\)", "\\1", dat$Start.Location))
dat$to_longitude <- as.numeric(gsub("POINT \\((-?\\d+\\.\\d+) \\d+\\.\\d+\\)", "\\1", dat$End.Location))
dat$to_latitude <- as.numeric(gsub("POINT \\(-?\\d+\\.\\d+ (-?\\d+\\.\\d+)\\)", "\\1", dat$End.Location))

#Create time bins 
dat2 <- dat %>%
  mutate(interval60 = floor_date(ymd_hms(Start.Time), unit = "hour"), #downside is that if it's 7:59, itll be 7:00 
         interval15 = floor_date(ymd_hms(Start.Time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

Load in 2021 5Y American Community Survey Census Data for Hamilton County. We evaluate the following variables: Total population, median household income, median age, white vs non-white populations, percent public transit commuters, and median commute time.

chatCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          state = 47, 
          county= 065,
          geometry = TRUE, 
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

#extract geometries 

city_limits <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Assignment5/BikeShare/Chattanooga City Limits/geo_export_48f449e6-f033-4fc7-afa3-be73bce74c1d.shp") %>% 
  st_transform(crs=st_crs(chatCensus)) 

city_limits <- st_make_valid(city_limits)
chatCensus <- st_make_valid(chatCensus)

# Perform the intersection
intersected_chatTracts <- st_intersection(chatCensus, city_limits)

# Extract geometries for the intersected tracts
chatTracts <- 
  intersected_chatTracts %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

#add census tracts 
dat_census <- st_join(dat2 %>% 
          filter(is.na(from_longitude) == FALSE &
                   is.na(from_latitude) == FALSE &
                   is.na(to_latitude) == FALSE &
                   is.na(to_longitude) == FALSE) %>%
          st_as_sf(., coords = c("from_longitude", "from_latitude"), crs = 4326),
        chatTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("to_longitude", "to_latitude"), crs = 4326) %>%
  st_join(., chatTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

Load in weather data from Chattanooga Metropolitan Airport’s weather station for March 01 to May 31, 2022. Chattanooga has a distinct spring season where by mid-April precipitation falls and temperatures steadily rise.

weather.Panel <- 
  riem_measures(station = "KCHA", date_start = "2022-03-01", date_end = "2022-05-31") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Chattanooga KCHA - Spring 2022")

# # Create a trend line for weekly rides to add on top of weather 
# trend_line <- ggplot(dat_census %>%
#                        group_by(week) %>%
#                        tally(),
#                      aes(x = week, y = n)) +
#   geom_line(size = 0.2)

Exploratory Analysis

Bike Chattanooga has 42 stations that are concentrated in the Downtown and North Chattanooga neighborhoods.

highways <- opq("Chattanooga, TN") %>% 
  add_osm_feature(key = "highway", value = c("motorway", "motorway_link", "trunk", "primary")) %>% 
  osmdata_sf()

rivers <- opq("Chattanooga, TN") %>% 
  add_osm_feature(key = 'waterway', value = "river") %>%
  osmdata_sf()

ggplot() +
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(data = dat2, aes(x = from_longitude, y = from_latitude), 
             color = "blue", alpha = 0.7, size = 0.3) +
  labs(title = "Distribution of Bikeshare Stations in Chattanooga",
       subtitle = "March - May 2022 | Dark gray lines are major motorways | Light gray lines are census tracts")+ 
  mapTheme

# #station by rides - tried to show station popularity
# ggplot()+
#   geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
#   geom_point(data = dat_census %>%
#               group_by(Start.Station.ID) %>%
#               summarize(avg_weekly_rides = mean(n)),
#             aes(x = from_longitude, y = from_latitude, color = avg_weekly_rides, size = avg_weekly_rides),
#             alpha = 1) +
#   # scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
#   scale_colour_viridis(direction = -1, discrete = FALSE, option = "G")+
#   ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
#   xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
#   labs(title="Avg Weekly Bike share trips by station",
#           subtitle = "March - May 2022 | Chattanooga, TN")+
#   mapTheme

Temporal Analysis

Examining our bike share data, we notice clear temporal trends: peaks during certain hours of the day and a longer-time decline in trips across the season. There is a spike of activity in early March during the Erlanger Chattanooga Marathon weekend. Generally, weekend days are more popular for riding Bike Chattanooga than weekday days.

ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n), size=0.2)+
  labs(title="Bike share trips per hour",
       subtitle = "March - May 2022 | Chattanooga, TN",
       x="Date", 
       y="Number of trips")+
  plotTheme

Breaking these trends down to the, we notice peaks at certain times of day and certain days of the week. Overall ridership is higher on the weekend, and peaks a few hours earlier compared to the weekday’s after work rush. Generally, afternoon rush hours are associated with the highest ridership volumes during the week.

ggplot(dat_census %>% mutate(hour = hour(Start.Time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
    scale_color_manual(values = c("#D74E09", "#bdd7e7","#6baed6","#3182bd","#08519c", "blue", "#6E0E0A"), name = "Day") + 
  labs(title="Bike share trips, by day of the week",
      subtitle = "March - May 2022 | Chattanooga, TN",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

ggplot(dat_census %>% 
         mutate(hour = hour(Start.Time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  scale_color_manual(values = c("#3182bd", "#D74E09")) +
  labs(title="Bike share trips - weekend vs weekday",
        subtitle = "March - May 2022 | Chattanooga, TN",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, Start.Station.Name, time_of_day) %>%
         tally()%>%
  group_by(Start.Station.Name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1)+
  scale_fill_manual(values = palette5, name="Time of Day") + 
  labs(title="Mean Number of Hourly Trips Per Station",
       subtitle = "March - May 2022 | Chattanooga, TN",
       x="Number of stations", 
       y="Trips Frequency")+
  facet_wrap(~time_of_day)+
  theme(legend.position = "none") +
  plotTheme

Spatial Analysis

Spatially, we observe that the highest ridership per station clusters in the Downtown and North Chattanooga areas, particularly in the mid-day and afternoon.

ggplot()+
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_point(data = dat_census %>% 
            mutate(hour = hour(Start.Time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(Start.Station.ID, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 1, size = 0.3)+
  # scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
  scale_colour_viridis(direction = -1, discrete = FALSE, option = "G")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station",
          subtitle = "March - May 2022 | Chattanooga, TN")+
  mapTheme

#mapping demographics in chattanooga to better understand the city's dynamics 

#chattanooga tracts
intersected_chatCensus <- st_intersection(chatCensus, chatTracts)

plot1 <- ggplot() +
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_sf(data = intersected_chatCensus, aes(fill = Total_Pop), color = "white", size = 0.05) +
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(data = dat2, aes(x = from_longitude, y = from_latitude), 
             color = "blue", alpha = 0.7, size = 0.3) +
  scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
  labs(title = "Tracts by Total Population", subtitle = "2022 | Chattanooga, TN") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  theme(legend.position = "bottom") +
  mapTheme

plot2 <- ggplot() +
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_sf(data = intersected_chatCensus, aes(fill = Med_Inc), color = "white", size = 0.05) +
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(data = dat2, aes(x = from_longitude, y = from_latitude), 
             color = "blue", alpha = 0.7, size = 0.3) +
  scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
  labs(title = "Tracts by Median Income", subtitle = "2022 | Chattanooga, TN") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  theme(legend.position = "bottom") +
  mapTheme


plot3 <- ggplot() +
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_sf(data = intersected_chatCensus %>% mutate(Percent_Non_White = 1 - Percent_White), 
          aes(fill = Percent_Non_White), color = "white", size = 0.05) +
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(data = dat2, aes(x = from_longitude, y = from_latitude), 
             color = "blue", alpha = 0.7, size = 0.3) +
  scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
  labs(title = "Tracts by Percent Non-White", subtitle = "2022 | Chattanooga, TN") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  theme(legend.position = "bottom") +
  mapTheme

# # # install.packages("patchwork")
# library(patchwork)
# plot1 + plot2 + plot3

Feature Engineering

To build our model, we conduct feature engineering, exploratory data analysis, and correlation analysis. The primary engineered feature are temporal lags which represent the number of trips in the previous hours. We find that the lags are strongly correlated with our dependent variable, Trip_Count, which provides strong insight into how to re-balance the Bike Chattanooga fleet.

Time-Space Panel

length(unique(dat_census$interval60)) * length(unique(dat_census$Start.Station.ID)) #69720 combinations

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), #going to populate combinations 
              Start.Station.ID = unique(dat_census$Start.Station.ID)) %>% #summarize by hour and station
  left_join(., dat_census %>% #joins by station_id since its the only unique variable 
              select(Start.Station.ID, Start.Station.Name, Origin.Tract, from_longitude, from_latitude )%>%
              distinct() %>%
              group_by(Start.Station.ID) %>%
              slice(1))

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, Start.Station.ID, Start.Station.Name, Origin.Tract, from_longitude, from_latitude) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(Start.Station.ID) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, chatCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

## Create Time Logs

#install.packages("spatialreg")
library(spatialreg)

ride.panel <- 
  ride.panel %>% 
  arrange(Start.Station.ID, interval60) %>%  
  mutate(lagHour = dplyr::lag(Trip_Count,1), #from an hour ago
         lag2Hours = dplyr::lag(Trip_Count,2), #from two hours ago
         lag3Hours = dplyr::lag(Trip_Count,3), #from three hours ago, etc. 
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>% #Memorial day - but when I change to Marathon weekend, Day 65, it breaks - not sure why
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag)) 


#look at the correlation between the lags
as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2)) %>%
    kbl(caption = "Correlations between Engineered Features and Dependent Variable") %>%
    kable_paper(bootstrap_options = "striped", full_width = TRUE)

To provide a more intuitive visual representation of the bike share trips, we plot an animated map with temporal and spatial context. Additionally, we observe that stations are currently located in wealthier neighborhoods rather than the most populous areas.

stations<-
  dat2 %>% 
  dplyr::select(Start.Station.ID, from_longitude, from_latitude) %>%
  group_by(Start.Station.ID) %>%
  distinct()%>%
  filter(is.na(from_longitude) == FALSE &
           is.na(from_latitude) == FALSE) %>%
  st_as_sf(., coords = c("from_longitude", "from_latitude"), crs = 4326)


one_day <-
  dat_census %>%
  filter(week == 14 & dotw == "Thu")

one_day.panel <-
  expand.grid(
    interval60 = unique(one_day$interval60),
    Origin.Tract = unique(ride.panel$Origin.Tract))


animation_data <-
  mutate(one_day, Trip_Counter = 1) %>%
  right_join(one_day) %>% 
  group_by(interval60, Start.Station.ID) %>%
  summarise(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  right_join(stations, by=c("Start.Station.ID" = "Start.Station.ID")) %>% 
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count == 1 ~ "1 trips",
                           Trip_Count == 2 ~ "2 trips",
                           Trip_Count > 2 & Trip_Count <= 4 ~ "3-4 trips",
                           Trip_Count > 4 ~ "5+ trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1 trips","2 trips",
                              "3-4 trips","5+ trips"))


bike_animation <-
  ggplot() +
  geom_sf(data = chatCensus %>%
            st_transform(crs=4326))+
  geom_sf(data = chatCensus, aes(fill = Med_Inc), color = "white", size = 0.05) +
  geom_sf(data = animation_data, 
             aes(color = Trips), size = 2) +
  scale_color_manual(values = palette4) +
  scale_fill_gradient(low = "gray90", high = "gray30", na.value = "white") +  # Change the color palette
 ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
 xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title = "Bike Trips by Start Stations for the first week of April, 2022",
       subtitle = "60-minute intervals: {current_frame}") +
  transition_manual(interval60) +
  mapTheme

animate(bike_animation, duration=10, renderer = gifski_renderer())

Spatiotemporal Prediction

To conduct regression analysis and prediction on bike ride data, the data is split into training (ride.Train) and testing (ride.Test) sets based on the week variable. Several linear regression models (reg1 to reg6) are fitted using the ride.Train data and different combinations of predictor variables, including hour of the day, day of the week, temperature, precipitation, lags of trip counts, holiday information, and demographic variables.

When comparing Model 3 (Start.Station.Name + hour(interval60) + dotw + Temperature + Precipitation) and Model 6 (Model 3 + lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + Total_Pop + Med_Inc + Percent_Taking_Public_Trans), Model 6 has a higher R² (0.227) compared to Model 3 (0.171) which suggests that Model 6 explains a larger proportion of the variance in the dependent variable compared to Model 3. Additionally, Model 6 has a slightly lower residual standard error (1.201) compared to Model 3 (1.243) which indicates a better fit of the model to the data. Lastly, while both models have a significant F statistic, Model 6 (F = 63.822) has a higher value compared to Model 3 (F = 55.888).

#splits data about in half 
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  Start.Station.Name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  Start.Station.Name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  Start.Station.Name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  Start.Station.Name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

reg6 <-
  lm(Trip_Count ~  Start.Station.Name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + 
                   Total_Pop + Med_Inc + Percent_Taking_Public_Trans, 
     data=ride.Train)
library(stargazer)
stargazer(reg3, reg6, type="text") 

We defined a function to predict trip counts for a given model and test dataset. For each model (reg1 to reg6), predictions are generated along with observed values, are gathered into a long format, and additional metrics like Absolute Error (AE), Mean Absolute Error (MAE), and standard deviation of AE.

#Predict for test data 
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)} #predict for fit using the dat dataframe 

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred), #mapping the function for different models d
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
           FTime_Space_FE_timeLags_holidayLags_demo = map(.x = data, fit = reg6, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

Discussion

Examine Error Metrics for Accuracy

The plot illustrates the Mean Absolute Errors (MAE) for different model specifications across weeks. Notably, Models 4, 5, and 6 include time lags as a feature and consistently display lower MAE. While Model D has a slightly lower MAE, the enhanced statistical performance of Model F may justify its marginally higher MAE.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette6) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

The plot illustrates the predicted and observed time series of bike share trips for different model specifications across multiple weeks. The discrepancy between observed and predicted values, particularly around peak times, may arise from the inherent complexity of real-world bike share trip patterns. Factors such as external events, unforeseen fluctuations in demand, or unaccounted variables in the models may contribute to the differences. Generally, this comparative analysis shows how well each model aligns with the observed data, providing insights into the predictive performance of different model specifications.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID)) %>%
    dplyr::select(interval60, Start.Station.ID, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -Start.Station.ID) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "March - May 2022 | Chattanooga | A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme

Spatially, we observe that there is higher MAE at stations proximate to the Tennessee River.

Alt text

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, Start.Station.ID, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_demo") %>%
  group_by(Start.Station.ID, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = chatCensus, color = "grey", fill = "transparent")+
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model F", 
       subtitle = "Time_Space_FE_timeLags_holidayLags_demo")+
  mapTheme


# install.packages("mapview")
# library(mapview)
# 
# # Assuming you have an sf object for chatCensus
# chatCensus_sf <- st_as_sf(chatCensus)
# 
# # Your existing code
# result <- week_predictions %>% 
#   mutate(interval60 = map(data, pull, interval60),
#          Start.Station.ID = map(data, pull, Start.Station.ID), 
#          from_latitude = map(data, pull, from_latitude), 
#          from_longitude = map(data, pull, from_longitude)) %>%
#   select(interval60, Start.Station.ID, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
#   unnest() %>%
#   filter(Regression == "FTime_Space_FE_timeLags_holidayLags_demo") %>%
#   group_by(Start.Station.ID, from_longitude, from_latitude) %>%
#   summarize(MAE = mean(abs(Observed - Prediction), na.rm = TRUE))
# 
# # Creating an sf object for the results
# result_sf <- st_as_sf(result, coords = c("from_longitude", "from_latitude"))
# 
# # Plotting with mapview
# mapview(chatCensus_sf) +
#   mapview(highways$osm_lines, col.regions = "gray30", alpha.regions = 0.5) +
#   mapview(result_sf, zcol = "MAE", alpha = 0.8, legend = TRUE)

Space-Time Error Evaluation

The scatterplots depict the relationship between observed and predicted bike share trips, with each point representing a specific time period and station. While the plots show a generally positive correlation, the models undercount the number of predicted bike trips.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, Start.Station.ID, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "blue")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

When comparing the the relationship between Mean Absolute Error (MAE) and three socio-economic variables (Percent_Taking_Public_Trans, Med_Inc, Percent_White) during the “AM Rush” versus the “PM Rush”, the “AM Rush” suggests a stronger, negative relationship bewteen errors and those who take public transportation to work.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, Start.Station.ID, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(Start.Station.ID, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-Start.Station.ID, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chatCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables - AM RUSH",
       y="Mean Absolute Error (Trips)")+
  plotTheme

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, Start.Station.ID, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "PM Rush") %>%
  group_by(Start.Station.ID, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-Start.Station.ID, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chatCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables - PM RUSH",
       y="Mean Absolute Error (Trips)")+
  plotTheme

Results

Cross-validation is performed across time, rather than space, to assess the performance and generalizability of a predictive model. The results of the model indicate a Mean Absolute Error (MAE) of approximately 0.646 with a standard deviation of around 0.037. A lower MAE is generally desirable, indicating better predictive performance. The standard deviation of the MAE gives an idea of the variability of prediction errors across different folds of the cross-validation process.

control <- trainControl(method="cv", number=100)

set.seed(42)

model_cv <- train(Trip_Count ~ Start.Station.ID + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + 
                   Total_Pop + Med_Inc + Percent_Taking_Public_Trans, 
                  data=na.omit(ride.panel),
                  method="lm",
                  trControl=control)

model_cv$resample %>% 
  summarise(MAE = mean(model_cv$resample[,3]),
            sd(model_cv$resample[,3])
) %>%
  kbl(col.name=c('Mean Absolute Error','Standard Deviation of MAE')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Mean Absolute Error Standard Deviation of MAE
0.6456238 0.0367577

Enhancing the model’s performance relies on two main factors: refining feature engineering and increasing computational resources. Incorporating binary variables to account for factors like rush hours may improve accuracy. However, the most significant improvement would likely come from expanding the dataset, enabling the model to discern nuanced patterns, such as variations in commuting behavior on different holidays, such as Marathon Sunday.Additionally, transitioning from ordinary least squares (OLS) regression to models explicitly addressing spatial dependencies, such as spatial lag regression or geographically-weighted regression, particularly around the city’s highways and waterways. Ultimately, maximizing the dataset size would likely prove beneficial. Training the model on a more extensive historical dataset, encompassing the approximately 11 years of available Bike Chattanooga data, would likely yield substantial improvements in accuracy and generalizability.

Conclusion

Our model offers valuable insights for optimizing Bike-Chattanooga’s rebalancing efforts. Its overall accuracy surpasses a naive approach that reacts to emerging demand. However, challenges persist, particularly in handling the non-random spatial distribution of errors around the Tennessee River. Notably, the model struggles to generalize well in high-demand areas, where the consequences of underprediction are most pronounced. While the model remains a robust predictive tool, there are notable opportunities for improvement, especially in enhancing its treatment of spatial fixed effects

---
title: "Space-Time Prediction of Bike Share Demand - Chattanooga, TN"
author: "Laura Frances, University of Pennsylvania"
date: "November 12, 2023"
output: 
  html_document:
    toc: true
    toc_float: true
    code_folding: "hide"
    code_download: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

# install.packages("lubridate")
# install.packages("dyplr")
# install.packages("tidyverse")
# install.packages("ggplot2")
# install.packages("sf")
# install.packages("tidyr")
# install.packages("tigris")
# install.packages("tidycensus")
# install.packages("gridExtra")
# install.packages("knitr")
# install.packages("kableExtra")
# install.packages("RSocrata")
# install.packages("viridis")
# install.packages("viridisLite")
# install.packages("riem")
#install.packages("osmdata")
#install.packages("gganimate")
#install.packages("caret")

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(osmdata)
library(gganimate)
library(caret)

options(scipen=999) #no scientific notation
options(tigris_class = "sf") #whenever we get tigris, we want sf 
options(tigris_use_cache = TRUE) 

census_api_key("c49a9145a5cad600a8b6393d31dbed862a0211a1", overwrite = TRUE, install = TRUE)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette6 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c", "blue")
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

```

# **Lab Objective**

The following study will explore Bike Chattanooga's usage patterns, weather patterns, and local context factors from March 01 to May 31, 2022 to develop a model that can potentially inform an effective rebalancing strategy. 

# **Background**

When Bike Chattanooga launched in 2012, it was the first city in the Southeastern US to offer a bikeshare program. Since then, over 600,000 trips have been taken on the Bike Chattanooga system - not a bad stat for a city of 180,000 residents. The city has 400 rental bicycles, including 55 E-Bikes, that are rented and may be returned at 42 locations around the municipality. With new grant money from the Tennessee DOT and private philanthropy, and demonstrated growth year over year, expanded e-bike stations and bike routes are in the pipeline. [(source)](https://www.timesfreepress.com/news/2022/jul/28/bike-chattanoogboosts-ridership-over-10-years/) 

Urban bikeshare programs have evolved as a response to the increasing demand for sustainable and efficient urban transportation. Originating in the mid-20th century in Europe, early bikeshare systems gained popularity in the 21st century with the advent of technology and increased environmental consciousness. One of the pioneering systems was Lyon's Vélo'v in 2005, setting the stage for global adoption. Subsequently, major cities worldwide introduced their own bikeshare programs. These initiatives typically operate through a network of strategically placed stations where users can rent bicycles for short trips. Users access bikes through mobile apps or membership cards, unlocking them at one station and returning them at another. The pay-per-ride or subscription-based model encourages short-distance commuting, contributing to reduced traffic congestion and environmental impact while promoting a healthier lifestyle. 

The success of urban bikeshare programs hinges on their accessibility, ease of use, and integration with public transit systems. Rebalancing is a critical logistics hurdle for bikeshare program's reliability and success. Since many cities experience bikeshare 'rush hours' it is often the case that bikes are not where riders need them to be. Rebalancing involves the redistribution of bicycles across stations to address spatial and temporal variations in user demand. Currently, many cities employ dedicated teams of rebalancing staff who manually transport bikes between stations based on real-time usage patterns and predictive models. However, the manual nature of this process poses logistical challenges, including high operational costs, inefficiencies, and delays in response to sudden demand fluctuations. The challenge lies in maintaining a delicate equilibrium in bike distribution to optimize user experience, making it imperative for bikeshare programs to explore more efficient and data-driven rebalancing strategies.

# **Methods**

This study employs three primary datasets: 

1) Bike Chattanooga station and ridership data from Spring 2022
2) weather data collected at Chattanooga Metropolitan Airport during those same dates and accessed via the riem package in R
3) 2021 5-Y American Community Survey data for Hamilton County, TN. 
4) Open Streets Maps data about highway and waterway locations.

## Data Wrangling

Wrangle the data to be operable with the lubridate R package.

```{r echo=TRUE, warning=FALSE, message=FALSE, results='hide'}
dat <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Assignment5/BikeShare/Bike_Chattanooga_Trip_Data_Mar22_May22.csv")

dat$Start.Time <- gsub("/", "-", dat$Start.Time)
dat$Start.Time <- gsub(" AM$", " AM", dat$Start.Time)
dat$Start.Time <- gsub(" PM$", " PM", dat$Start.Time)
dat$End.Time <- gsub("/", "-", dat$End.Time)
dat$End.Time <- gsub(" AM$", " AM", dat$End.Time)
dat$End.Time <- gsub(" PM$", " PM", dat$End.Time)

# Convert to 24-hour clock
dat$Start.Time <- mdy_hms(dat$Start.Time)

# Format the result in 24-hour clock
dat$Start.Time <- format(dat$Start.Time, "%Y-%m-%d %H:%M:%S")

#Parse Out from Lat Long 
dat$from_longitude <- as.numeric(gsub("POINT \\((-?\\d+\\.\\d+) \\d+\\.\\d+\\)", "\\1", dat$Start.Location))
dat$from_latitude <- as.numeric(gsub("POINT \\(-?\\d+\\.\\d+ (-?\\d+\\.\\d+)\\)", "\\1", dat$Start.Location))
dat$to_longitude <- as.numeric(gsub("POINT \\((-?\\d+\\.\\d+) \\d+\\.\\d+\\)", "\\1", dat$End.Location))
dat$to_latitude <- as.numeric(gsub("POINT \\(-?\\d+\\.\\d+ (-?\\d+\\.\\d+)\\)", "\\1", dat$End.Location))

#Create time bins 
dat2 <- dat %>%
  mutate(interval60 = floor_date(ymd_hms(Start.Time), unit = "hour"), #downside is that if it's 7:59, itll be 7:00 
         interval15 = floor_date(ymd_hms(Start.Time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))
```

Load in 2021 5Y American Community Survey Census Data for Hamilton County. We evaluate the following variables: Total population, median household income, median age, white vs non-white populations, percent public transit commuters, and median commute time. 

```{r get_census, message=FALSE, warning=FALSE, cache=TRUE, results = 'hide'}
chatCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          state = 47, 
          county= 065,
          geometry = TRUE, 
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

#extract geometries 

city_limits <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Assignment5/BikeShare/Chattanooga City Limits/geo_export_48f449e6-f033-4fc7-afa3-be73bce74c1d.shp") %>% 
  st_transform(crs=st_crs(chatCensus)) 

city_limits <- st_make_valid(city_limits)
chatCensus <- st_make_valid(chatCensus)

# Perform the intersection
intersected_chatTracts <- st_intersection(chatCensus, city_limits)

# Extract geometries for the intersected tracts
chatTracts <- 
  intersected_chatTracts %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

#add census tracts 
dat_census <- st_join(dat2 %>% 
          filter(is.na(from_longitude) == FALSE &
                   is.na(from_latitude) == FALSE &
                   is.na(to_latitude) == FALSE &
                   is.na(to_longitude) == FALSE) %>%
          st_as_sf(., coords = c("from_longitude", "from_latitude"), crs = 4326),
        chatTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("to_longitude", "to_latitude"), crs = 4326) %>%
  st_join(., chatTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)
```

Load in weather data from Chattanooga Metropolitan Airport's weather station for March 01 to May 31, 2022. Chattanooga has a distinct spring season where by mid-April precipitation falls and temperatures steadily rise. 

```{r message=FALSE, warning=FALSE, cache=TRUE}
weather.Panel <- 
  riem_measures(station = "KCHA", date_start = "2022-03-01", date_end = "2022-05-31") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Chattanooga KCHA - Spring 2022")


# # Create a trend line for weekly rides to add on top of weather 
# trend_line <- ggplot(dat_census %>%
#                        group_by(week) %>%
#                        tally(),
#                      aes(x = week, y = n)) +
#   geom_line(size = 0.2)

```

## Exploratory Analysis


Bike Chattanooga has 42 stations that are concentrated in the Downtown and North Chattanooga neighborhoods. 

```{r message=FALSE, warning=FALSE, cache=TRUE, fig.width=10}

highways <- opq("Chattanooga, TN") %>% 
  add_osm_feature(key = "highway", value = c("motorway", "motorway_link", "trunk", "primary")) %>% 
  osmdata_sf()

rivers <- opq("Chattanooga, TN") %>% 
  add_osm_feature(key = 'waterway', value = "river") %>%
  osmdata_sf()

ggplot() +
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(data = dat2, aes(x = from_longitude, y = from_latitude), 
             color = "blue", alpha = 0.7, size = 0.3) +
  labs(title = "Distribution of Bikeshare Stations in Chattanooga",
       subtitle = "March - May 2022 | Dark gray lines are major motorways | Light gray lines are census tracts")+ 
  mapTheme

# #station by rides - tried to show station popularity
# ggplot()+
#   geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
#   geom_point(data = dat_census %>%
#               group_by(Start.Station.ID) %>%
#               summarize(avg_weekly_rides = mean(n)),
#             aes(x = from_longitude, y = from_latitude, color = avg_weekly_rides, size = avg_weekly_rides),
#             alpha = 1) +
#   # scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
#   scale_colour_viridis(direction = -1, discrete = FALSE, option = "G")+
#   ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
#   xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
#   labs(title="Avg Weekly Bike share trips by station",
#           subtitle = "March - May 2022 | Chattanooga, TN")+
#   mapTheme
```

### Temporal Analysis 

Examining our bike share data, we notice clear temporal trends: peaks during certain hours of the day and a longer-time decline in trips across the season. There is a spike of activity in early March during the Erlanger Chattanooga Marathon weekend. Generally, weekend days are more popular for riding Bike Chattanooga than weekday days. 

```{r message=FALSE, warning=FALSE}
ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n), size=0.2)+
  labs(title="Bike share trips per hour",
       subtitle = "March - May 2022 | Chattanooga, TN",
       x="Date", 
       y="Number of trips")+
  plotTheme

```

Breaking these trends down to the, we notice peaks at certain times of day and certain days of the week. Overall ridership is higher on the weekend, and peaks a few hours earlier compared to the weekday's after work rush. Generally, afternoon rush hours are associated with the highest ridership volumes during the week.

```{r warning = FALSE, message = FALSE }
ggplot(dat_census %>% mutate(hour = hour(Start.Time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
    scale_color_manual(values = c("#D74E09", "#bdd7e7","#6baed6","#3182bd","#08519c", "blue", "#6E0E0A"), name = "Day") + 
  labs(title="Bike share trips, by day of the week",
      subtitle = "March - May 2022 | Chattanooga, TN",
       x="Hour", 
       y="Trip Counts")+
     plotTheme


ggplot(dat_census %>% 
         mutate(hour = hour(Start.Time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  scale_color_manual(values = c("#3182bd", "#D74E09")) +
  labs(title="Bike share trips - weekend vs weekday",
        subtitle = "March - May 2022 | Chattanooga, TN",
       x="Hour", 
       y="Trip Counts")+
     plotTheme
```

```{r warning = FALSE, message = FALSE }
dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, Start.Station.Name, time_of_day) %>%
         tally()%>%
  group_by(Start.Station.Name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1)+
  scale_fill_manual(values = palette5, name="Time of Day") + 
  labs(title="Mean Number of Hourly Trips Per Station",
       subtitle = "March - May 2022 | Chattanooga, TN",
       x="Number of stations", 
       y="Trips Frequency")+
  facet_wrap(~time_of_day)+
  theme(legend.position = "none") +
  plotTheme
```

```{r include = FALSE }
ggplot(dat_census %>%
         group_by(interval60, Start.Station.Name) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5, fill="#3182bd")+
  labs(title="Bike share trips per hr by station",
        subtitle = "March - May 2022 | Chattanooga, TN",
       x="Number of Stations", 
       y="Trip Counts")+
  plotTheme
```

### Spatial Analysis

Spatially, we observe that the highest ridership per station clusters in the Downtown and North Chattanooga areas, particularly in the mid-day and afternoon. 

```{r message=FALSE, warning=FALSE }
ggplot()+
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_point(data = dat_census %>% 
            mutate(hour = hour(Start.Time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(Start.Station.ID, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 1, size = 0.3)+
  # scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
  scale_colour_viridis(direction = -1, discrete = FALSE, option = "G")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station",
          subtitle = "March - May 2022 | Chattanooga, TN")+
  mapTheme
```



```{r message=FALSE, warning=FALSE}

#mapping demographics in chattanooga to better understand the city's dynamics 

#chattanooga tracts
intersected_chatCensus <- st_intersection(chatCensus, chatTracts)

plot1 <- ggplot() +
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_sf(data = intersected_chatCensus, aes(fill = Total_Pop), color = "white", size = 0.05) +
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(data = dat2, aes(x = from_longitude, y = from_latitude), 
             color = "blue", alpha = 0.7, size = 0.3) +
  scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
  labs(title = "Tracts by Total Population", subtitle = "2022 | Chattanooga, TN") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  theme(legend.position = "bottom") +
  mapTheme

plot2 <- ggplot() +
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_sf(data = intersected_chatCensus, aes(fill = Med_Inc), color = "white", size = 0.05) +
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(data = dat2, aes(x = from_longitude, y = from_latitude), 
             color = "blue", alpha = 0.7, size = 0.3) +
  scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
  labs(title = "Tracts by Median Income", subtitle = "2022 | Chattanooga, TN") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  theme(legend.position = "bottom") +
  mapTheme


plot3 <- ggplot() +
  geom_sf(data = chatTracts %>% st_transform(crs = 4326), fill = "darkgray", color = "gray90") +
  geom_sf(data = intersected_chatCensus %>% mutate(Percent_Non_White = 1 - Percent_White), 
          aes(fill = Percent_Non_White), color = "white", size = 0.05) +
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(data = dat2, aes(x = from_longitude, y = from_latitude), 
             color = "blue", alpha = 0.7, size = 0.3) +
  scale_fill_gradient(low = "#D2FBD4", high = "#527D82", na.value = "transparent") +
  labs(title = "Tracts by Percent Non-White", subtitle = "2022 | Chattanooga, TN") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  theme(legend.position = "bottom") +
  mapTheme

# # # install.packages("patchwork")
# library(patchwork)
# plot1 + plot2 + plot3
```

## Feature Engineering

To build our model, we conduct feature engineering, exploratory data analysis, and correlation analysis. The primary engineered feature are temporal lags which represent the number of trips in the previous hours. We find that the lags are strongly correlated with our dependent variable, Trip_Count, which provides strong insight into how to re-balance the Bike Chattanooga fleet.

### Time-Space Panel

```{r message = FALSE, warning = FALSE, results='hide'}
length(unique(dat_census$interval60)) * length(unique(dat_census$Start.Station.ID)) #69720 combinations

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), #going to populate combinations 
              Start.Station.ID = unique(dat_census$Start.Station.ID)) %>% #summarize by hour and station
  left_join(., dat_census %>% #joins by station_id since its the only unique variable 
              select(Start.Station.ID, Start.Station.Name, Origin.Tract, from_longitude, from_latitude )%>%
              distinct() %>%
              group_by(Start.Station.ID) %>%
              slice(1))

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, Start.Station.ID, Start.Station.Name, Origin.Tract, from_longitude, from_latitude) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(Start.Station.ID) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, chatCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

## Create Time Logs

#install.packages("spatialreg")
library(spatialreg)

ride.panel <- 
  ride.panel %>% 
  arrange(Start.Station.ID, interval60) %>%  
  mutate(lagHour = dplyr::lag(Trip_Count,1), #from an hour ago
         lag2Hours = dplyr::lag(Trip_Count,2), #from two hours ago
         lag3Hours = dplyr::lag(Trip_Count,3), #from three hours ago, etc. 
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>% #Memorial day - but when I change to Marathon weekend, Day 65, it breaks - not sure why
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag)) 


#look at the correlation between the lags
as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2)) %>%
    kbl(caption = "Correlations between Engineered Features and Dependent Variable") %>%
    kable_paper(bootstrap_options = "striped", full_width = TRUE)
```

To provide a more intuitive visual representation of the bike share trips, we plot an animated map with temporal and spatial context. Additionally, we observe that stations are currently located in wealthier neighborhoods rather than the most populous areas.

```{r animated map, warning = FALSE, message = FALSE}
stations<-
  dat2 %>% 
  dplyr::select(Start.Station.ID, from_longitude, from_latitude) %>%
  group_by(Start.Station.ID) %>%
  distinct()%>%
  filter(is.na(from_longitude) == FALSE &
           is.na(from_latitude) == FALSE) %>%
  st_as_sf(., coords = c("from_longitude", "from_latitude"), crs = 4326)


one_day <-
  dat_census %>%
  filter(week == 14 & dotw == "Thu")

one_day.panel <-
  expand.grid(
    interval60 = unique(one_day$interval60),
    Origin.Tract = unique(ride.panel$Origin.Tract))


animation_data <-
  mutate(one_day, Trip_Counter = 1) %>%
  right_join(one_day) %>% 
  group_by(interval60, Start.Station.ID) %>%
  summarise(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  right_join(stations, by=c("Start.Station.ID" = "Start.Station.ID")) %>% 
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count == 1 ~ "1 trips",
                           Trip_Count == 2 ~ "2 trips",
                           Trip_Count > 2 & Trip_Count <= 4 ~ "3-4 trips",
                           Trip_Count > 4 ~ "5+ trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1 trips","2 trips",
                              "3-4 trips","5+ trips"))


bike_animation <-
  ggplot() +
  geom_sf(data = chatCensus %>%
            st_transform(crs=4326))+
  geom_sf(data = chatCensus, aes(fill = Med_Inc), color = "white", size = 0.05) +
  geom_sf(data = animation_data, 
             aes(color = Trips), size = 2) +
  scale_color_manual(values = palette4) +
  scale_fill_gradient(low = "gray90", high = "gray30", na.value = "white") +  # Change the color palette
 ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
 xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title = "Bike Trips by Start Stations for the first week of April, 2022",
       subtitle = "60-minute intervals: {current_frame}") +
  transition_manual(interval60) +
  mapTheme

animate(bike_animation, duration=10, renderer = gifski_renderer())
```

## Spatiotemporal Prediction

To conduct regression analysis and prediction on bike ride data, the data is split into training (ride.Train) and testing (ride.Test) sets based on the week variable. Several linear regression models (reg1 to reg6) are fitted using the ride.Train data and different combinations of predictor variables, including hour of the day, day of the week, temperature, precipitation, lags of trip counts, holiday information, and demographic variables.

When comparing Model 3 (Start.Station.Name + hour(interval60) + dotw + Temperature + Precipitation) and Model 6 (Model 3 + lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + Total_Pop + Med_Inc + Percent_Taking_Public_Trans), Model 6 has a higher R² (0.227) compared to Model 3 (0.171) which suggests that Model 6 explains a larger proportion of the variance in the dependent variable compared to Model 3. Additionally, Model 6 has a slightly lower residual standard error (1.201) compared to Model 3 (1.243) which indicates a better fit of the model to the data. Lastly, while both models have a significant F statistic, Model 6 (F = 63.822) has a higher value compared to Model 3 (F = 55.888).

```{r message = FALSE, warning = FALSE, results='hide'}
#splits data about in half 
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  Start.Station.Name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  Start.Station.Name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  Start.Station.Name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  Start.Station.Name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

reg6 <-
  lm(Trip_Count ~  Start.Station.Name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + 
                   Total_Pop + Med_Inc + Percent_Taking_Public_Trans, 
     data=ride.Train)

```

```{r message = FALSE, warning = FALSE, results='hide'}
library(stargazer)
stargazer(reg3, reg6, type="text") 
``` 

We defined a function to predict trip counts for a given model and test dataset. For each model (reg1 to reg6), predictions are generated along with observed values, are gathered into a long format, and additional metrics like Absolute Error (AE), Mean Absolute Error (MAE), and standard deviation of AE.

```{r message = FALSE, warning = FALSE, results='hide'}

#Predict for test data 
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)} #predict for fit using the dat dataframe 

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred), #mapping the function for different models d
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
           FTime_Space_FE_timeLags_holidayLags_demo = map(.x = data, fit = reg6, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

```


# **Discussion**

## Examine Error Metrics for Accuracy

The plot illustrates the Mean Absolute Errors (MAE) for different model specifications across weeks. Notably, Models 4, 5, and 6 include time lags as a feature and consistently display lower MAE. While Model D has a slightly lower MAE, the enhanced statistical performance of Model F may justify its marginally higher MAE. 

```{r plot_errors_by_model }
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette6) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme
```

The plot illustrates the predicted and observed time series of bike share trips for different model specifications across multiple weeks. The discrepancy between observed and predicted values, particularly around peak times, may arise from the inherent complexity of real-world bike share trip patterns. Factors such as external events, unforeseen fluctuations in demand, or unaccounted variables in the models may contribute to the differences. Generally, this comparative analysis shows how well each model aligns with the observed data, providing insights into the predictive performance of different model specifications.

```{r error_vs_actual_timeseries , warning = FALSE, message = FALSE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID)) %>%
    dplyr::select(interval60, Start.Station.ID, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -Start.Station.ID) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "March - May 2022 | Chattanooga | A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme
```

Spatially, we observe that there is higher MAE at stations proximate to the Tennessee River.

<div style="text-align:center">
<img src="/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Assignment5/BikeShare/mapview map.png" alt="Alt text" width="75%" align = 'v'>
</div>

```{r errors_by_station, warning = FALSE, message = FALSE, eval=FALSE }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, Start.Station.ID, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_demo") %>%
  group_by(Start.Station.ID, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = chatCensus, color = "grey", fill = "transparent")+
  geom_sf(data = highways$osm_lines, color = "gray30", size = 0.3, alpha = 0.5) +
  geom_sf(data = rivers$osm_polygons, color = "lightblue", alpha = 0.5) +
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model F", 
       subtitle = "Time_Space_FE_timeLags_holidayLags_demo")+
  mapTheme


# install.packages("mapview")
# library(mapview)
# 
# # Assuming you have an sf object for chatCensus
# chatCensus_sf <- st_as_sf(chatCensus)
# 
# # Your existing code
# result <- week_predictions %>% 
#   mutate(interval60 = map(data, pull, interval60),
#          Start.Station.ID = map(data, pull, Start.Station.ID), 
#          from_latitude = map(data, pull, from_latitude), 
#          from_longitude = map(data, pull, from_longitude)) %>%
#   select(interval60, Start.Station.ID, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
#   unnest() %>%
#   filter(Regression == "FTime_Space_FE_timeLags_holidayLags_demo") %>%
#   group_by(Start.Station.ID, from_longitude, from_latitude) %>%
#   summarize(MAE = mean(abs(Observed - Prediction), na.rm = TRUE))
# 
# # Creating an sf object for the results
# result_sf <- st_as_sf(result, coords = c("from_longitude", "from_latitude"))
# 
# # Plotting with mapview
# mapview(chatCensus_sf) +
#   mapview(highways$osm_lines, col.regions = "gray30", alpha.regions = 0.5) +
#   mapview(result_sf, zcol = "MAE", alpha = 0.8, legend = TRUE)

```

## Space-Time Error Evaluation


The scatterplots depict the relationship between observed and predicted bike share trips, with each point representing a specific time period and station. While the plots show a generally positive correlation, the models undercount the number of predicted bike trips. 

```{r obs_pred_all, warning=FALSE, message = FALSE, cache=TRUE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, Start.Station.ID, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "blue")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme
```


```{r warning=FALSE, message = FALSE, include = FALSE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, Start.Station.ID, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(Start.Station.ID, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = chatCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme
  
```

When comparing the the relationship between Mean Absolute Error (MAE) and three socio-economic variables (Percent_Taking_Public_Trans, Med_Inc, Percent_White) during the "AM Rush" versus the "PM Rush", the "AM Rush" suggests a stronger, negative relationship bewteen errors and those who take public transportation to work. 

```{r station_summary2, warning=FALSE, message = FALSE }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, Start.Station.ID, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(Start.Station.ID, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-Start.Station.ID, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chatCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables - AM RUSH",
       y="Mean Absolute Error (Trips)")+
  plotTheme

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           Start.Station.ID = map(data, pull, Start.Station.ID), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, Start.Station.ID, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "PM Rush") %>%
  group_by(Start.Station.ID, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-Start.Station.ID, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chatCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables - PM RUSH",
       y="Mean Absolute Error (Trips)")+
  plotTheme
  
```

# **Results** 

Cross-validation is performed across time, rather than space, to assess the performance and generalizability of a predictive model. The results of the model indicate a Mean Absolute Error (MAE) of approximately 0.646 with a standard deviation of around 0.037. A lower MAE is generally desirable, indicating better predictive performance. The standard deviation of the MAE gives an idea of the variability of prediction errors across different folds of the cross-validation process.

```{r message=FALSE, warning=FALSE}
control <- trainControl(method="cv", number=100)

set.seed(42)

model_cv <- train(Trip_Count ~ Start.Station.ID + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + 
                   Total_Pop + Med_Inc + Percent_Taking_Public_Trans, 
                  data=na.omit(ride.panel),
                  method="lm",
                  trControl=control)

model_cv$resample %>% 
  summarise(MAE = mean(model_cv$resample[,3]),
            sd(model_cv$resample[,3])
) %>%
  kbl(col.name=c('Mean Absolute Error','Standard Deviation of MAE')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```


Enhancing the model's performance relies on two main factors: refining feature engineering and increasing computational resources. Incorporating binary variables to account for factors like rush hours may improve accuracy. However, the most significant improvement would likely come from expanding the dataset, enabling the model to discern nuanced patterns, such as variations in commuting behavior on different holidays, such as Marathon Sunday.Additionally, transitioning from ordinary least squares (OLS) regression to models explicitly addressing spatial dependencies, such as spatial lag regression or geographically-weighted regression, particularly around the city's highways and waterways. Ultimately, maximizing the dataset size would likely prove beneficial. Training the model on a more extensive historical dataset, encompassing the approximately 11 years of available Bike Chattanooga data, would likely yield substantial improvements in accuracy and generalizability. 

# **Conclusion** 

Our model offers valuable insights for optimizing Bike-Chattanooga's rebalancing efforts. Its overall accuracy surpasses a naive approach that reacts to emerging demand. However, challenges persist, particularly in handling the non-random spatial distribution of errors around the Tennessee River. Notably, the model struggles to generalize well in high-demand areas, where the consequences of underprediction are most pronounced. While the model remains a robust predictive tool, there are notable opportunities for improvement, especially in enhancing its treatment of spatial fixed effects
